Chris Pollett > Old Classes > CS267
( Print View )

Student Corner:
  [Grades Sec2]

  [Submit Sec2]

  [Class Sign Up Sec2]

  [Lecture Notes]
  [Discussion Board]

Course Info:
  [Texts & Links]
  [Topics/Outcomes]
  [Outcomes Matrix]
  [Grading]
  [HW/Quiz Info]
  [Exam Info]
  [Regrades]
  [Honesty]
  [Additional Policies]
  [Announcements]

HW Assignments:
  [Hw1] [Hw2] [Hw3]
  [Hw4] [Quizzes]

Practice Exams:
  [Mid 1] [Mid 2] [Final]

HW#1 --- last modified February 07 2019 23:03:03..

Solution set.

Due date: Sep 12

Files to be submitted:
Hw1.zip

Purpose:To dust off our programming skills and start working on coding an inverted index. To gain experience using an open source crawler. To practice evaluating search results.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

LO1 - Code a basic inverted index capable of performing conjunctive queries.

LO6 - Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework consists of three parts: A set of book problems, which are to be done individually, should be handwritten (no typing), and must be turned in at the start of class on the day indicated; some experiment with an open source search engine; and a coding assignment. The latter two parts of this assignment can be done in a team of at most three people. For this portion only one team member needs to submit -- just be sure to mention the names of your team mates in the documentation for your code.

The book problems I want you to do are: Exercise 1.3 and 1.4 (due Aug. 29); Exercise 1.6 and 1.7 (due Sep. 5); Exercise 2.1 and 2.3 (due Sep 12).

For Part 2, I want you to download Yioop! or Nutch and have it crawl the site: mirror.cs.sjsu.edu. Perform a query involving a high frequency word (for example, "the"), a medium frequency word, and a low frequency. Look at the source code of the search engine and try to come up with a short explanation of why it ranked the results as it did. In the file Hw1Part2.txt describe how you configured Yioop! or Nutch. Then have your explanation, and finally, include the results of your queries.

For Part 3, I want you to code a simple inverted index printer. It will be run from the command line with a line like:

aux_program IndexPrinter filename

Here aux_program is at most one of java, php, python (your choice). If you choose to do this project in C or C++, aux_program is optional. Your program will be compiled with the compiler in Mac OSX Mountain Lion, so try not to use anything fancy. (Java 1.6 for Java, LLVM for C/C++, PHP 5.3.13, Python 2.7). filename is the name of some txt document. This document is assumed to contain only ASCII characters with '\n' used for line endings. For this assignment, the sequence '\n\n' indicates a new "document" within this file. A "term" is any sequence of characters that does not include a white-space character. Once run, your code should read in this file into a data structure of your choice in memory. We assume for this homework the file is small enough to be read into memory. Your program should sort all terms seen in any document. For each term in this sorted order, your program should output to stdout, the term on a line by itself. On the next line it should output the number of documents the term appeared in, comma, the number of occurrences of the term across all documents, comma, and then a list a sequence of pairs in the form (doc_num, # of occurrence within this doc). Here a given pair appears if and only if that document contained the term at least once. A snippet of this table might look like:

aadvark
2,4,(5,1),(8,3)
aaron
1,1,(2,1)
...

In the above (5,1) indicates the term aadvark occurred once in the fifth document in the file (counting from 0). Once done outputting this data your program should stop. That is all there is to Part 3.

Point Breakdown

Part 1 (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong)	3pts
Part 2 (1pt configuration description, 1pt test query outputs seem believable, 1 explanation of how query works).	3pts
Part 3 (1pt Code well documented and follows departmental coding guidelines for language in question or PEAR guidelines for PHP, 1pt reads the documents in the file into a reasonable data structure, 1pts seems to be calculating the desired statistics, 1pt output as describe above).	4pts
Total	10pts
Total	10pts